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Abstract 



We consider a non-cooperative constrained stochastic games with N players with the following special struc- 
ture. With each player i there is an associated controlled Markov chain MDPi. The transition probabilities 
of the ith Markov chain depend only on the state and actions of controller i. The information structure that 
we consider is such that each player knows the state of its own MDP and its own actions. It does not know 
the states of, and the actions taken by other players. Finally, each player wishes to minimize a time-average 
cost function, and has constraints over other time-avrage cost functions. Both the cost that is minimized as 
well as those denning the constraints depend on the state and actions of all players. We study in this paper the 
existence of a Nash equilirium. Examples in power control in wireless communications are given. 

> 

On 
O ; 

C*~) | Non-cooperative games deal with a situation of several decision makers (often called agents, users or players) where 
the cost of each one of the players may be a function of not only its own decision but also of decisions of other 
players. The choice of a decision by any player is done so as to minimize its own individual cost. 

Non-cooperative games also allow to model sequential decision making by non-cooperating players. They allow 
■ to model situations in which the parameters defining the games vary in time. The game is then said to be a dynamic 
game and the parameters that may vary in time are the states of the game. At any given time (assumed to be 
discrete) each player takes a decision (also called an action) according to some strategy. The vector of actions 
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5— i ' chosen by players at a given time (called a multi-action may determine not only the cost for each player at that 
^ time; it can also determine the state evolution. Each player is interested in minimizing some functions of all the 
costs at different time instants. In particular, we shall consider here the expected time-average costs for the players. 

We consider in this paper the class of stochastic decentralized games which we call "cost coupled constrained 
stochastic games" and are characterized by the following: 

1. We associate to each player a Markov chain, whose transition probabilities depend only on the action of that 
player, 

2. We assume that at any time, each player has information only on the current and past states of his own 
Markov chain as well as of his previous actions. It does not know the state and actions of other players. 

3. Each player has constraints on its strategies (to be defined later). We consider the general situation in which 
the constraints for a player depend on the strategies used by other players. 
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4. There are cost functions (one per player) that depend on the states and actions of all players, and each player 
wishes to minimize its own cost. 

We see that players "interact" only through the last two points above. 

It is well known that identifying equilibrium policies (even in absence of constraints) is hard. Unlike the situation 
in Markov Decision Processes (MDPs) in which stationary optimal strategies are known to exist (under suitable 
conditions), and unlike the situation in constrained MDPs (CMDPs) with a multichain structure, in which optimal 
Markov policies exist [T31 HH] > we know that equilibrium strategies in stochastic games need in general to depend 
on the whole history (see e.g. [TI5] for the special case of zero-sum games). This difficulty has motivated researchers 
to search for various possible structures of stochastic games in which saddle point policies exist among stationary 
or Markov strategies and are easier to compute [TT] . In line with this approach, we shall identify conditions under 
which constrained equilibria exist for cost-coupled conostrained stochastic games. 

Related work. Several papers have already dealt with constrained stochastic games. In [7], the authors have 
established the existence of a constnrained equilibrium in a context of centralized stochastic games, in which all 
players jointly control a single Markov chain and in which all players have full information on its state. Moreover, 
when taking decision at time t, each player has information on all actions previously taken by all players. 

The special cost-coupled structure (see Definion 12.11) has been investigated in [121 [2] in zero-sum games where 
there is a single cost which one of the players wishes to minimize and which a second player wishes to maximize. A 
highly non-stationary saddle-point was obtained in [32] for a zero-sum constrained stochastic games with expected 
average costs. 

Alghough the question of existence of an equilibrium in cost-coupled stochastic games has not been considered 
before, some specific applications of such games have been formulated. Indeed, these games have been used ex- 
tensively by Huang, Malhame and Caines in a series of publications [THl [T7j . Although they have not established 
the existence of a Nash equilibrium, they have been able to obtain an e-Nash equilibrium for the case of a large 
population of players. Models concerning uplink power control, similar to the one studied in |16j . have been inves- 
tigated in [3J, in which the structure of constrained equilibrium is established. We note however that in the models 
considered in [3J, the local Markovian states of each user are not controlled; the decisions of each user have an 
impact only the costs and not the transition probabilities. 

2 The model and main result 

We consider a game with N players, labeled 1, . . . , N. Define for each player i the tuple {Xj, A», Vi, Cj, Vi, /3j} where 

• Xj is a finite local state space of the zth player. Generic notation for states will be x, y or Xi,yi. We let 
X := Ylj—i Xj be the global state space, and we define X_j := Yij^i be the global to be the set of all 
possible states of players other than i. 

• Aj is a finite set of actions. We denote by A,-(xj) the set of actions available for player i at state x. A generic 
notation for a vector of actions will be a = (ai, ajv) where dj stands for the action chosen by player i. 

• Define the local set of state-action pairs for player i as set /Q = {{xi, ai) : Xi € Xj, dj € Aj(a;)}. Denote the 
set of all global state-action pairs by K, = YliLj an d let /C_,; = Yij^ denote the set of state-action pairs 
of all players other than i. 

• V are the transition probabilities for player i; thus V % XiaiVi is the probability that the state of player i moves 
from Xi to yi if she chooses action cij. 

• c = {c^}, i = 1, N, j — 0,1,..., Bi is a set of immediate costs, where : JC — ► IR. Thus player i has a set of 
Bi + 1 immediate costs; c° will correspond to the cost function that is to be minimized by that player, and 
c| , j > will correspond to cost functions on which some constraints are imposed. 

• V — {V?}, i = l, N, j — 1, Bi are bounds defining the constraints (see ^ below). 
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• (3i is a probability distribution for the initial state of the Markov chain of player i. The intial states of the 
players are assumed to be independent. 

Histories, Information and policies. Let Mi(G) denote the set of probability measures over a set G. Define 
a history of player i at time (or of length) t to be a sequence of her previous states and actions, as well as her current 
local state: h\ — [x\,a\, ...,x t i ~ ,a*~ ,xj) where (xf,af) e /Q for all s = 1, ...,t. Let H* be the set of all possible 
histories of length t for player i. A policy (also called a strategy) Ui for player i is a sequence Ui = (uj, uf, ...) where 
u\ : H* — » Mi(Ai) is a function that assigns to any history of length t a probability measure over the set of actions 
of player i. 

At time t, each player i chooses an action a^, independently of the choice of actions of other players, with 
probability u\(ai\h\) if the history h\ was observed by player i. Denote a = (a%, ajsr)- 

The class of all policies defined as above for player i is denoted by U l . The collection U — IIt=i U l 1S called the 
class of multi-policies (Y[ stands for the product space) . 

Stationary policies. A stationary policy for player i is a function u, : X, — » Mi(Aj) so that Uj(-|:rj) G 
Mi(Aj(xj)). We denote the class of stationary policies of player i by Uf . The set Us = YliLi is called the class 
of stationary multi-policies. Under any stationary multi-policy u (where the u l are stationary for all the players), 
at time t, the controllers, independently of each other, choose actions a = (oi, ajv), where action is chosen by 
player i with probability Ui(a,i\x\) if state x\ was observed by player i at time t. 

For u S U we use the standard notation w_i to denote the vector of policies Uk,k ^ i; moreover, for tn £ Ui, 
we define to be the multi-policy where, for k 7^ i, player k uses Uk, while player i uses V{. Define U~ l :— 

l) ueU {U-i}. 

A distribution f3 for the initial state (at time 1) and a multi-policy u together define a probability measure 
Pp which determines the distribution of the vector stochastic process {X 1 , A 1 } of states and actions, where 

X* = {Xf}i-i m and A* — {A\}i=\ t ... t js[. The expectation that corresponds to an initial distribution (3 and a 

policy u is denoted by Eg. 

Costs and constraints. For any multi-policy u and (3, define the i, j-expected average cost is defined as 

1 T 

&j{f3,u) = lim -*£E u p4(Xt,A t ). (1) 
t=l 

A multi-policy u is called i-feasible if it satisfies: 

CM{p,u)<V?, for all j — 1, Bi. (2) 
It is called feasible if it is z-feasible for all the players i — 1, N. Let Uv be the set of feasible policies. 

Definition 2.1 (i) A multi- policy u £ U v is called constrained Nash equilibrium if for each player i = 1,...,N and 

for any Vi such that [u-i\vi] is i-feasible, 

C l <°(f3,u) <C lfl {P,[u^W])- (3) 

Thus, any deviation of any player i will either violate the constraints of the ith player, or if it does not, it will result 
in a cost C 1,0 for that player that is not lower than the one achieved by the feasible multi-policy u. 
(ii) For any multi-policy u, Ui is called an optimal response for player i against U-i if u is i-feasible, and if for any 
v % such that is i-feasible, (0i holds. 

(Hi) A multi-policy v is called an optimal response against u if for every i = 1, ...,N, Vi is an optimal response for 
player i against u-i. 

Assumptions. We introduce the following assumptions 
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• (III) Ergodicity: For each player i and for any stationary policy itj of that player, the state process of that 
player is an irreducible Markov chain with one ergodic class (and possibly some transient states). 

• (II2) Strong Slater condition: There exists some real number r\ > such that the following holds. Every player 
i has some policy Vi such that for any multi-strategy of the other players, 

C^(0,([u-i\Vi])<V^ -V, for allj = l,. (4) 

• (II3) Information: The strategy chosen by any player does not depend on the realization of the cost. 

The last assumption is frequently encountered in game theory and in applications, see e.g. [HJ [25]. The 
assumption is in fact directly implied by the definition of policies. If it were allowed to have policies depend on the 
realization of the cost, then a player could use the costs to estimate the state and actions of the other player. 

We are now ready to introduce the main result. 

Theorem 2.1 Assume that IT and II2 hold. Then there exists a stationary multi-policy u which is constrained- 
Nash equilibrium. 

Remark 2.1 If assumption II2 does not hold, the upper semi- continuity which is needed for proving the existence 
of an equilibrium ( see Proposition \3.1\l need not hold. This is true even for the case of a single player, see ^j. 

3 Proof of main result 

We begin by describing the way an optimal stationary response for player i is computed for a given stationary 
multi-policy u. Fix a stationary policy Ui for player i. With some abuse of notation, we denote for any Xi G Xj 
and any y l G X 4 , 

^XiUiyi = ^ ] Ui((li \Xi)T > XiaiVi ■ 
a,6Ai(i,) 

Denote the immediate costs induced by players other than i, when player i uses action <Zj and the other players 
use a stationary multi policy U-i, by 



4' u (xi,a,i) := E 

(x,a)_ 4 G/C_i 



\\ui(ai\xi)iTi(xi) 



c^(x,a) a = [a_j|aj], x = 



Next we present a Linear Program (LP) for computing the set of all optimal responses for player i against a 
stationary policy 

LP(i,u) : 

Find z? u := {z* u (y, a)}y, a , where (y,a) G /Q, that minimizes 

c u°( z i) '■= X! c °i' U (y' a ) z iAy^ a ) subject to: (5) 

(y,a)eKi 

E z ^(y> a ) - *».-•] = °» w e x - ( 6 ) 

(y,a)eKi 

Ctl j (zi,u)~ E ^ u {y,a) Zi , u {y,a) <Vi 1 < j < B t (7) 

(y,a)£Ki 

Zi, u (y,a)>0, V(y,a)G/Ci E z iAVi a ) = l ( 8 ) 
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Define T(i,u) to be the set of optimal solutions of LP(i, u). 

Given a set of nonnegative real numbers Zj = {zi(y, a), (y, a) £ fCi(y)}, define the point to set mapping 7(1, z{) 
as follows: If J2 a Zi(y,a) ^ then 7^(1, Zj) := {zi(y,a){J2 a Zi(y,a)]~ 1 } is a singleton: for each y, we have that 
J y (zi) = {jy(zi) : a S Aj(y)} is a point in Mi(Aj(y)). Otherwise, 7^(1, z) := Mi(Aj(y)), i.e. the (convex and 
compact) set of all probability measures over Aj(y). 

Define g l {zi) to be the set of stationary policies for player i that choose, at state yi, action a with probability 
in jy(i,Zi). 

For any stationary multi-policy v define the occupation measures 

f((3,v) := {fi(vi;yi,ai) : (yi,ai) 6 Ki, i= 1, --,N} 

as follows. Let 

fi(vi;yi,a,i) := 7r?*(y)tt;(ai|y;), 

where is the steady state (invariant) probability of the Markov chain describing the state process of player i, 
when her policy is Vi . Note that a unique steady state probability exists by Assumption II 1 and it does not depend 
on j3. We thus often omit (5 from the notation. 

Proposition 3.1 Assume II1-II3. Fix any stationary multi-policy u. 

(i) If z* is an optimal solution for LP(i.?i) then any element w in g l (z* u ) is an optimal stationary response of i 
against the stationary policy U-i. Moreover, the multi-policy v = [u-i\w] satisfies fi(v) — z* u (it does not depend 
on (3). 

(ii) Assume that w is an optimal stationary response of player i against the stationary policy u^i, and let v :— 
[u—i\w]. Then fi(v) does not depend on (3 and is optimal for LP '(i , u) . i 

(Hi) The optimal sets r(i,u), i = 1, ...,7V are convex, compact, and upper semi- continuous in U-i, where u is 
identified with points in YliLi Tlx ex Mi(Ai(xi)) . 

(iv) For each i, g l (z) is upper semi-continuous in z over the set of points which are feasible for LP(i, it) (i.e. the 
points that satisfy constraints (dl-fSPj. 



Proof: When all players other than i use u_i, then player i is faced with a constrained Markov decision process 
(with a single controller). The proof of (i) and (ii) then follows from [S] Theorems 2.6. The first part of (iii) follows 
from standard properties of Linear Programs, whereas the second part follows from an application of the theory 
of sensitivity analysis of Linear Programs by Dantzig, Folkman and Shapiro |10j in [S] Theorem 3.6 to LP(i,zt). 
Finally, (iv) follows from the definition of g l (z). ■ 



Define the point to set map 



JV 



N 



by 



*(z) = nr(',9'W) 



where z = (21, . . . ,z/v), each Zi is interpreted as a point in Mi{K,i) and g(z) = ((^(zi), . . . ,g N (zi\[))- 
Proof of Theorem I2.lt By Kakutani's fixed point theorem, a fixed point z E ^(z) exists. Proposition 13.11 (i) 
implies that for any such fixed point, the stationary multi-policy g = {g l (zi):i = 1, N} is a constrained Nash 
equilibrium. ■ 
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Remark 3.1 (i) The Linear Program formulation LP(i, u) is not only a tool for proving the existence of a con- 
strained Nash equilibrium; in fact, due to Proposition 13.11 (ii), it can be shown that any stationary constrained 
Nash equilibrium w has the form w = {g l (zi); i = 1, TV} for some z which is a fixed point of "J. 
(ii) It follows from [5] Theorems 2.4 and 2.5 that if z = (z\, z^) is a fixed point of \&, then any stationary 

multi-policy g in Yl i= ig i (zi) satisfies C J,: '(/3, i g) = C l ^(z),i = 1, ...,N, j = 0, ...,Bi. Conversely, if w is a constrained 
Nash equilibrium then 

yeXaGAi(y) 

(and f(w) is a fixed point of W). 
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